# import all packages and set plots to be embedded inline
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
# load in the dataset into a pandas dataframe, print statistics
PLD = pd.read_csv('prosperLoanData.csv', dtype = {'LoanStatus' : str})
PLD.head(5)
# high-level overview of data shape and composition
print(PLD.shape)
print(PLD.dtypes)
print(PLD.head(10))
PLD.nunique()
ordinal_var_dict = {'LoanStatus': ['FinalPaymentInProgress','Current','Completed' ,'Defaulted','Chargedoff','Past Due (61-90 days)'
,'Past Due (31-60 days)','Past Due (>120 days)','Past Due (91-120 days)','Past Due (16-30 days)'],
'ProsperRating (Alpha)': ['A', 'AA', 'B', 'C', 'D', 'E', 'HR'],
'EmploymentStatus': ['Self-employed' , 'Employed' ,'Not available', 'Full-time', 'Other',
'Retired' , 'Part-time']}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
PLD[var] = PLD[var].astype(ordered_var)
PLD[PLD.isnull().any(axis=1)].count()
print(PLD.describe())
PLD.head(0)
because Prosper has only been using its own proprietary Prosper Rating since 2009, we have a lot of missing ProsperRating column values. Let's get these missing values dropped:
PLD = PLD.dropna(subset=['ProsperRating (Alpha)']).reset_index()
PLD['LoanOriginationDate'] = pd.to_datetime(PLD['LoanOriginationDate'])
PLD.info()
because there is no previous credit history on Prosper the value for column TotalProsperLoans will be NaN, let's replace it with 0.
PLD['TotalProsperLoans'] = PLD['TotalProsperLoans'].fillna(0)
PLD.head(5)
What is the structure of your dataset?¶
There are 113937 loan data in the dataset with 82 features (ListingKey, ListingNumber, ListingCreationDate, CreditGrade, Term, LoanStatus, ClosedDate, BorrowerAPR, BorrowerRate,...etc). ( 61 colum ) Most variables are numeric in nature, but the variables EmploymentStatus, LoanStatus, ProsperRating (Alpha), and BorrowerState are ordered factor variables
I'm most interested in figuring out what features are best for predicting the proposer loans rate of success in the dataset.
I expect that borrower APR will have the strongest effect on each rate of success proposer loans: the length of the loan expressed in months (term) ,The Borrower's interest rate for this loan (BorrowerRate) ,LoanOriginalAmount .
I'll start by looking at the distribution of the main variable of interest: the proposer loans rate of success.
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
binsize = 0.01
bins = np.arange(0, PLD['BorrowerRate'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = PLD, x = 'BorrowerRate', bins = bins)
plt.xlabel('BorrowerRate')
plt.show()
The distribution of the borrower rate apears to be bimodal with first peak around 0.16, larger peak (true mode) around 0.32. Let's check number of occurrences:
binsize = 20
bins = np.arange(0, PLD['Term'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = PLD, x = 'Term', bins = bins)
plt.xticks([12 ,36 ,60])
plt.xlabel('Term')
plt.show()
For loan terms there are three options: 36, 60 and 12. Most common is 36 months.
binsize = 0.025
bins = np.arange(0, PLD['BorrowerAPR'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = PLD, x = 'BorrowerAPR', bins = bins)
plt.xlabel('BorrowerAPR')
plt.show()
the maximam APR rate are between 0.2 more or less
binsize = 1000
bins = np.arange(0, PLD['LoanOriginalAmount'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = PLD, x = 'LoanOriginalAmount', bins = bins)
plt.xlabel('LoanOriginalAmount')
plt.show()
most loans are less than 10000 bound
# let's plot all three together to get an idea of each ordinal variable's distribution.
fig, ax = plt.subplots(nrows=4, figsize = [30,40])
default_color = sb.color_palette()[0]
sb.countplot(data = PLD, x = 'LoanStatus', color = default_color, ax = ax[0] )
#plt.xticks(rotation=50)
sb.countplot(data = PLD, x = 'ProsperRating (Alpha)', color = default_color, ax = ax[1])
sb.countplot(data = PLD, x = 'EmploymentStatus', color = default_color, ax = ax[2])
sb.countplot(data = PLD, x = 'CurrentlyInGroup', color = default_color, ax = ax[3])
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=40)
plt.show()
tips:
sb.countplot(data=PLD, x='ListingCategory (numeric)', color=default_color)
The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
The overwhelming majority of loans are used for debt consolidation. Other notable categories include Other, Auto, Home Improvement and Business.
sb.countplot(data=PLD, x='Recommendations', color=default_color);
Overwhelming majority of loans obtained without recomendations.
fig = plt.subplots(figsize = [10,8])
default_color = sb.color_palette()[0]
sb.countplot(data = PLD, x = 'ProsperScore', color = default_color )
plt.show()
ProsperScore :A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. are normally disttributed.
Prosper ratings are almost normally distributed.ProsperScore are normally distributed .The distribution of borrowers APR looks multimodal. Most of the values are at the range of 0.05 and 0.4. There are no unusual points and no need to perform any transformations.
the count plots are normally distributed at most no need for any operations
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
numeric_vars = ['Term', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'StatedMonthlyIncome','LoanOriginalAmount',
'ProsperScore']
categoric_vars = ['LoanStatus','ProsperRating (Alpha)','EmploymentStatus' ,'ProsperRating (Alpha)','IsBorrowerHomeowner','CurrentlyInGroup']
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(PLD[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
BorrowerAPR and ProsperScore are negative because borrowers with lower score are more likely to pay higher APR. Similarly, higher CreditScore means the borrowers are more trustworthy, therefore it recevied lower APR.
# plot matrix: sample 500 diamonds so that plots are clearer and
# they render faster
samples = np.random.choice(PLD.shape[0], 500, replace = False)
PLD_samp = PLD.loc[samples,:]
g = sb.PairGrid(data = PLD_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter)
Matrix Plot: Similar to the correlation plot, we can determine which pair has negative or positive relationships from analyzing the pattern in each scatter plots. ProsperScore seems to be more related to BorrowerAPR compare to other variables. StatedMonthlyIncome does not give useful information on BorrowerAPR and will not be further analyzed.
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.
samples = np.random.choice(PLD.shape[0], 2000, replace = False)
PLD_samp = PLD.loc[samples,:]
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x, y, color = default_color)
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = PLD_samp, y_vars = numeric_vars , x_vars = categoric_vars,
size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();
The figure shows that the loan amount is increased with the increase of loan term. The borrower APR decreases with the better rating. Borrowers with the best Prosper ratings have the lowest APR. It means that the Prosper rating has a strong effect on borrower APR. Borrowers with better rating also have larger monthly income and loan amount. Employed, self-employed and full time borrowers have more monthly income and loan amount than part-time, retired and not employed borrowers.
# since there's only three subplots to create, using the full data should be fine.
plt.figure(figsize = [15, 15])
x =['LoanStatus','ProsperRating (Alpha)','EmploymentStatus' ,'ProsperRating (Alpha)','IsBorrowerHomeowner','CurrentlyInGroup']
# subplot 1: LoanStatus vs EmploymentStatus
plt.subplot(4, 1, 1)
sb.countplot(data = PLD, x = 'EmploymentStatus', hue = 'LoanStatus', palette = 'Blues')
# subplot 2: LoanStatus vs. ProsperRating (Alpha)
ax = plt.subplot(4, 1, 2)
sb.countplot(data = PLD, x = 'ProsperRating (Alpha)', hue = 'LoanStatus', palette = 'Blues')
ax.legend(ncol = 1) # re-arrange legend to reduce overlapping
# subplot 3: EmploymentStatus vs. IsBorrowerHomeowner, use different color palette
ax = plt.subplot(4, 1, 3)
sb.countplot(data = PLD, x = 'EmploymentStatus', hue = 'IsBorrowerHomeowner', palette = 'Greens')
ax.legend(loc = 1, ncol = 1) # re-arrange legend to remove overlapping
# subplot 1: LoanStatus vs EmploymentStatus
plt.subplot(4, 1, 4)
sb.countplot(data = PLD, x = 'CurrentlyInGroup', hue = 'LoanStatus', palette = 'Blues')
plt.show()
most loaonstatus as current are for employed ones , has a 'c' rate for porsperRating (ALpha) and that employed ones most of them owend their iwen homes and not currently in group
plt.figure(figsize = [8, 6])
plt.scatter(data = PLD, x = 'Term', y = 'EstimatedLoss' ,alpha= 1/10)
#plt.xlim([0, 3.5])
plt.xlabel('Term')
plt.xticks([12 ,36,60])
plt.yscale('log')
plt.ylabel('EstimatedLoss')
plt.show()
plt.figure(figsize = [8, 6])
plt.scatter(data = PLD, x = 'BorrowerRate', y = 'EstimatedLoss' ,alpha= 1/20)
#plt.xlim([0, 3.5])
plt.xlabel('BorrowerRate')
#plt.xticks([12 ,36,60])
#plt.yscale('log')
plt.ylabel('EstimatedLoss')
plt.show()
plt.figure(figsize = [8, 6])
sb.regplot(data = PLD, x = 'LoanOriginalAmount', y = 'BorrowerAPR', scatter_kws={'alpha':0.01});`
This plot shows that at different size of the loan amount, the APR has a large range, but the range of APR decrease with the increase of loan amount. Overall, the borrower APR is negatively correlated with loan amount.
The APR of the borrower is negatively correlated with the original amount of the loan, which means that the larger the amount of the loan, the lower the APR. It also indicates that the APR has a wide range at the different size of the loan amount, but with the rise in the loan amount, the range of APR decreases. The rating of Prosper also has a clear influence on the APR of the applicant, which decreases with the better rating.
The original amount of the loan is positively associated with the monthly income reported, because borrowers with more monthly income might lend more money, it makes sense. It also illustrates that better-rated borrowers also have greater monthly income and loan amounts. There is a relationship between a performance rating and a term. Proportionally, on B and C ratings, there are more 60-month loans. There are just 36 months of loans for borrowers with HR ratings.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
# Term effect on relationship of APR and loan amount
F=sb.FacetGrid(data=PLD, aspect=1.2, height=5, col='Term', col_wrap=4)
F.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
F.add_legend();
Term doesn't seem to have effect on relationship of APR and loan amount
# Prosper rating effect on relationship of APR and loan amount
H=sb.FacetGrid(data=PLD, aspect=1.2, height=5, col='ProsperRating (Alpha)', col_wrap=4)
H.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
H.add_legend();
With prosper rating, the loan sum and borrower APR have a negative relationship and turns marginally to likely as prosperRating is lifted from HR to A or better. This could be because individuals with A or AA ratings want to borrow more cash, raising APR that may deter them from borrowing any further and optimising benefit. But lower-rated individuals prefer to borrow less money, and a decrease in APR might allow them to borrow more.
fig = plt.figure(figsize = [8,6])
ax = sb.pointplot(data = PLD, x = 'ProsperRating (Alpha)', y = 'BorrowerAPR', hue = 'Term',
palette = 'Blues', linestyles = '', dodge = 0.4, ci='sd')
plt.title('Borrower APR across rating and term')
plt.ylabel('Mean Borrower APR')
ax.set_yticklabels([],minor = True);
Interestingly, the borrower APR decrease with the increase of borrow term for people with HR-C raings. But for people with B-AA ratings, the APR increase with the increase of borrow term.
fig, ax = plt.subplots(ncols=2, figsize=[12,6])
sb.pointplot(data = PLD, x = 'ProsperRating (Alpha)', y = 'StatedMonthlyIncome', hue = 'Term',
palette = 'Greens', linestyles = '', dodge = 0.4, ax=ax[0])
sb.pointplot(data = PLD, x = 'ProsperRating (Alpha)', y = 'LoanOriginalAmount', hue = 'Term',
palette = 'Purples', linestyles = '', dodge = 0.4, ax=ax[1]);
there is a interaction between term and rating. We will see that with better Prosper rating, the loan amount of all three terms increases, the amplitude of loan amount between terms also becomes larger and it is increasing. it doesn't seem like there is a interaction effect between term and rating, the consequences of term is similar among different ratings.
I make my exploration analysis of borrower APR against loan amount by looking at the impact of the Prosper rating. The multivariate exploration showed that the relationship between borrower APR and loan amount turns from negative to slightly positive when the Prosper ratings increased from HR to AA I ,then i visualize the rating and term effects on loan amount, it is showing that if you have good Prosper rating, the loan amount of all three terms will be increase
there were a surprising interaction is that the borrower APR and loan amount is negatively correlated when the Prosper ratings are from HR to B, but the correlation is turned to be positive when the ratings are A and AA. Another interesting thing is that the borrower APR decrease with the increase of borrow term for people with HR-C raings. But for people with B-AA ratings, the APR increase with the borrow term.
At the end of your report, make sure that you export the notebook as an html file from the
File > Download as... > HTMLmenu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!